A note on using the F-measure for evaluating data linkage algorithms

نویسندگان

  • David Hand
  • Peter Christen
چکیده

Record linkage is the process of identifying and linking records about the same entities from one or more databases. Record linkage can be viewed as a classification problem where the aim is to decide if a pair of records is a match (i.e. two records refer to the same real-world entity) or a non-match (two records refer to two different entities). Various classification techniques — including supervised, unsupervised, semi-supervised and active learning based — have been employed for record linkage. If ground truth data in the form of known true matches and non-matches are available, the quality of classified links can be evaluated. Due to the generally high class imbalance in record linkage problems, standard accuracy or misclassification rate are not meaningful for assessing the quality of a set of linked records. Instead, precision and recall, as commonly used in information retrieval and machine learning, are used. These are often combined into the popular F-measure, which is the harmonic mean of precision and recall. We show that the F-measure can also be expressed as a weighted sum of precision and recall, with weights which depend on the linkage method being used. This reformulation reveals that the F-measure has a major conceptual weakness: the relative importance assigned to precision and recall should be an aspect of the problem and the researcher or user, but not of the particular linkage method being used. We suggest alternative measures which do not suffer from this fundamental flaw.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Comparison of Four Data Mining Algorithms for Predicting Colorectal Cancer Risk

Background and Objective: Colorectal cancer (CRC) is one of the most prevalent malignancies in the world. The early detection of CRC is not only a simple process, but it is also the key to its treatment. Given that data mining algorithms could be potentially useful in cancer prognosis, diagnosis, and treatment, the main focus of this study is to measure the performance of some data mining class...

متن کامل

A note on decision making in medical investigations using new divergence measures for intuitionistic fuzzy sets

Srivastava and Maheshwari (Iranian Journal of Fuzzy Systems 13(1)(2016) 25-44) introduced a new divergence measure for intuitionisticfuzzy sets (IFSs). The properties of the proposed divergence measurewere studied and the efficiency of the proposed divergence measurein the context of medical diagnosis was also demonstrated. In thisnote, we point out some errors in ...

متن کامل

Choosing the Best Hierarchical Clustering Technique Based on Principal Components Analysis for Suspended Sediment Load Estimation

1- INTRODUCTION The assessment of watershed sediment load is necessary for controling soil erosion and reducing the potential of sediment production. Different estimates of sediment amounts along with the lack of long-term measurements limits the accessibility to reliable data series of erosion rate and sediment yield. Therefore, the observed data of suspended sediment load could be used to ...

متن کامل

Appling Metaheuristic Algorithms on a Two Stage Hybrid Flowshop Scheduling Problem with Serial Batching (RESEARCH NOTE)

In this paper the problem of serial batch scheduling in a two-stage hybrid flow shop environment with minimizing Makesapn is investigated. In serial batching it is assumed that jobs in a batch are processed serially, and their completion time is defined to be equal to the finishing time of the last job in the batch. The analysis and implementation of the prohibited transference of jobs among th...

متن کامل

Prediction and evaluation of runoff data in south of Qazvin watershed, using a fuzzy logic technique

The important criteria for designing in the most of hydrologic and hydraulic construction projects are based on runoff or peak-flow of water. Mostly, this measure and criterion is calculated or estimated by stochastic data. Another feature of these data that are used in watershed hydrological studies is their impreciseness. Therefore, in this study, in order to deal with uncertainty and impreci...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2016